Method for producing a document summary

ABSTRACT

A method for producing a document summary from a document. The method includes: 
     associating with the document a specific category from a set of predetermined categories; 
     performing a thematic segmentation of the document to produce a segmented document, the segmented document including a plurality of text segments; 
     associating with each text segment from the plurality of text segments a theme selected from a set of predetermined themes; and 
     summarizing the segmented document to produce the document summary by processing each text segment from the plurality of text segments to either
         select at least one summary textual unit from the text segment, the at least one summary textual unit including at least one word and being a textual unit considered important in summarizing the document; or   extract no textual unit from the text segment.
 
The summary textual units are used to form the document summary. The thematic segmentation is dependent on the category to which the document is associated and the summary textual units are selected for each text segment depending on the theme with which the text segment is associated.

FIELD OF THE INVENTION

The present invention relates generally to the field of automated textprocessing and is particularly concerned with a method for producing adocument summary from a document.

BACKGROUND OF THE INVENTION

Significant advances made in information processing technologies in thelast few decades have led to the production of relatively largequantities of data. Due to the efficiency with which this data may beprocessed using information technologies, people often expect that thisdata be used efficiently by professionals working in many fields.

A specific field in which information is produced in large quantitiesand in which information needs to be adequately classified and reliablyaccessed is in the legal field. Indeed, legal experts perform relativelydifficult legal clerical work which requires accuracy and speed. Theselegal experts often summarize legal documents, such as judgments, andlook for information relevant to specific cases in these summaries.These tasks involve understanding, interpreting, explaining andresearching a wide variety of legal documents. A summary of a judgment,as a compressed but hopefully accurate statement of its contents, helpsin organizing a large volume of documents and in finding the relevantjudgments for a specific case.

For this reason, the judgments are frequently manually summarized bylegal experts. However, human time and expertise require to providemanual summaries for legal researches make human-generated summariesrelatively expensive. Also, there is always a risk that a legal expertmisinterprets a judgment and, therefore, classifies it in a wrong classby mistake or produces an erroneous summary

Because of the relatively large accuracy required in the classificationand summarization of judgments, commonly available automatedclassification and summarization methods are typically not suitable forthis task.

Accordingly, there exists a need for an improved insulating panel to avehicle. It is a general object of the present invention to provide suchan improved insulating panel.

SUMMARY OF THE INVENTION

In a first broad aspect, the invention provides a method for producing adocument summary from a document, the document including a plurality ofwords and being segmentable into a plurality of text segments, each textsegment including at least one word, the document being classifiable asbelonging to a category selected from a set of predetermined categoriesand each text segment being classifiable as belonging to a themeselected from a set of predetermined themes. The method includes:

-   -   associating with the document a specific category from the set        of predetermined categories;    -   performing a thematic segmentation of the document to produce a        segmented document, the segmented document including the        plurality of text segments;    -   associating with each text segment from the plurality of text        segments a theme selected from the set of predetermined themes;        and    -   summarizing the segmented document to produce the document        summary by processing each text segment from the plurality of        text segments to either    -   select at least one summary textual unit from the text segment,        the at least on summary textual unit including at least one of        the word, the at least one summary textual unit being a textual        unit considered important in summarizing the document; or    -   extract no textual unit from the text segment;

the summary textual units being used to form the document summary;

The thematic segmentation is dependent on the category to which thedocument is associated and the summary textual units are selected foreach text segment depending on the theme with which the text segment isassociated.

These dependencies have a synergetic effect that results in anunexpectedly high accuracy of the document summary.

For more clarity, for the purpose of this document, textual units arewords or groups of words that have a specific meaning. For example, inthe expression “Second World War”, the combination of the words“second”, “world” and “war” produces an expression that has by itself aspecific meaning. In other words, a textual unit relates to a conceptand one or more words are used to express this concept. In someembodiments of the invention, some textual units are whole sentences orwhole paragraphs, among other possibilities.

Also, in some embodiments of the invention, the document summaryincludes a summary of the document in the commonly accepted definitionof a comprehensive and usually brief recapitulation of the document.However, in alternative embodiments of the invention, the documentsummary organizes the information contained in the document in any othermanner to summarize the document. For example, and non-limitingly, thisinformation may be organized in table form.

Advantageously the proposed method is relatively efficient, relativelyfast and relatively reliable in summarizing certain categories ofdocuments such as, for example, and non-limitingly, legal documents andmore specifically judgments.

The proposed method is also relatively easily implemented using commonlyused programming languages and is of an efficiency such that it ispractical to execute this method on currently available computerhardware.

In addition to producing an accurate document summary from the document,the proposed method also allows to classify the judgments into aspecific category from the set of predetermined categories. Therefore,classification, which is often paramount into retrieving information inthe legal field, is automatically performed by the proposed methodwithout requiring any additional step.

In some embodiments of the invention, the proposed method is able toprocess documents in more than one language. This is implemented byfirst doing the summary of the document in the language in which thedocument is written. Afterwards, the document summary is translated intoat least one other language. Subsequently, the document summary may besearched using queries in one of the two languages. Therefore, theproposed method allows to relatively efficiently process documents inmany languages, such as occurs in jurisdictions for which there is morethan one official language.

In a variant, the document is associated with the specific categoryusing statistical methods, heuristic methods, or a combination of bothheuristic and statistical methods.

In some embodiments of the invention, a thematic segmentation isperformed paragraph by paragraph in the document. However, inalternative embodiments of the invention, the thematic segmentation ifperformed in any other suitable manner.

In a variant, the thematic segmentation is performed by usingstatistical methods, heuristic methods or a combination of statisticaland heuristic methods, among other possibilities.

By using a priori knowledge concerning the structure of the document,which is embedded into the statistical and heuristic methods used incategorizing, segmenting and summarizing the document, relativelycomplex documents may be relatively easily and accurately classified andsummarized.

In the proposed method, the segmentation is dependent upon the categoryin which the document is classified. Also, the extraction of significantsentences or portion of sentences from the document to produce adocument summary is dependent on the theme associated with each textsegment. Therefore, prior to being summarized, the document is processedto establish a context in which the summarization occurs, which improvesthe accuracy of the summary document. This manner of organizing thesegmentation and summarization of the document allows to producerelatively good summaries without human intervention.

In another broad aspect, the invention provides a computer readablestorage medium containing a program element for execution by a computingdevice, the program element being able to produce a document summaryfrom a document.

Other objects, advantages and features of the present invention willbecome more apparent upon reading of the following non-restrictivedescription of preferred embodiments thereof, given by way of exampleonly with reference to the accompanying drawings.

BRIEF DESCRIPTION OF THE DRAWINGS

An embodiment of the present invention will now be disclosed, by way ofexample, in reference to the following drawings in which:

FIG. 1, in a schematic view, illustrates a computing device forexecuting a program element implementing a method for producing adocument summary from a document in accordance with an embodiment of thepresent invention;

FIG. 2, in a schematic view, illustrates an example of a structure of adocument summarizable by the method executable onto the computing deviceof FIG. 1;

FIG. 3, in a schematic view, illustrates a method for producing adocument summary from a document, the document being shown in FIG. 2 andthe method being executable by a program element running on the computerof FIG. 1; and

FIG. 4, in a schematic view, illustrates the program elementimplementing the method of FIG. 3, the program element being executableby the computer of FIG. 1.

DETAILED DESCRIPTION

FIG. 1 is a block diagram of an apparatus for producing a documentsummary from a document in the form of a computing device 12. Thecomputing device 12 includes a Central Processing Unit (CPU) 22connected to a storage medium 24 over a data bus 26. Although thestorage medium 24 is shown as a single block, it may include a pluralityof separate components, such as a floppy disk drive, a fixed disk, atape drive and a Random Access Memory (RAM), among others. The computingdevice 12 also includes an Input/Output (I/O) interface 28 that connectsto the data bus 26. The computing device 12 communicates with outsideentities through the I/O interface 28. In a non-limiting example ofimplementation, the I/O interface 28 is a network interface.

The computing device 12 also includes an output device 30 to communicateinformation to a user. In the example shown, the output device 30includes a display. Optionally, the output device 30 includes a printeror a loudspeaker, among other suitable output device components. Thecomputing device 12 further includes an input device 32 through whichthe user may input data or control the operation of a program elementexecuted by the CPU 22. The input device 32 may include, for example,any one or a combination of the following: keyboard, pointing device,touch sensitive surface or speech recognition unit, among others.

When the computing device 12 is in use, the storage medium 24 holds aprogram element 300 (seen in FIG. 4) executed by the CPU 22, the programelement 300 implementing a method for producing a document summary froma document.

An example of such a method is illustrates in FIG. 3 and generallydesignated by the reference numeral 200. FIG. 2 illustrates an exampleof a document 100 that may be summarized using the method 200. Forexample, the document 100 is a legal document such as a court judgment.

The document 100 includes sections 105 a, 105 b and 105 c. Each of thesections 105 a, 105 b and 105 c includes a section heading andparagraphs. For example, as seen in FIG. 2, the paragraph 105 a includesa section heading 110 and two paragraphs 115 a and 115 b. In turn, eachof the paragraphs 115 a and 115 b includes sentences. For example, theparagraph 115 b includes four sentences, namely sentences 120 a, 120 b,120 c and 120 d. Finally, each of the sentences 120 a, 120 b, 120 c and120 d includes words such as, for example, words 125 a, 125 b, 125 c,125 d and 125 e of the sentence 120 d. The reader skilled in the artwill readily appreciate that the document 100 illustrated in FIG. 2 isshown for example purposes only and that the method 200 may be used tosummarize any suitable document.

The document 100 is segmentable into a plurality of text segments. Eachtext segment includes at least one of the words. Also, the document 100is classifiable as belonging to a category selected from a set ofpredetermined categories and each text segment is classifiable asbelonging to a theme selected from a set of predetermined themes.

Generally speaking, the method 200 involves the use of a prioriinformation regarding the structure of the document 100. This a prioriinformation is used to produce the document summary.

More specifically, the method 200 starts at step 205. At step 210, thedocument 100 is associated with a specific category from a set ofpredetermined categories. At set 215, the document is segmented and,afterwards, at step 220, the document is summarized. Finally, the methodends at step 225. The segmentation performed at step 215 is a thematicsegmentation and is dependent on the category to which the document isassociated. Also, step 220 of summarizing the document is performedsegment-by-segment and textual units, such as for example paragraphs,sentences or words, from each segment are selected for inclusion intothe summary depending on the theme to which the text segment isassociated. The a priori information regarding the document is embeddedinto the specific manner in which the document is categorized, segmentedand summarized.

By using this a priori information, it is possible to produce accuratesummaries of a wide variety of documents belonging to a general documenttype such as, for example, court judgments. The reader skilled in theart will readily appreciate that while examples given herein regardingthe method 200 refer to a court judgment, the proposed method isapplicable to any other suitable documents.

At step 210, a specific category from the set of predeterminedcategories is associated with the document 100. For example, in the caseof a judgment, the predetermined category associated with a specificdocument may be “immigration case relating to acceptance or refusal ofthe grant of a refugee status”. In some embodiments of the invention,the predetermined categories are organized according to a hierarchy,such as is often the case in many fields such as, for example, in thelegal field. Typically, but in no manner exclusively, the predeterminedcategories are categories that are commonly used in the field to whichthe document 100 relates.

While any suitable method may be used to categorize the document 100into a specific category, it has been found that a combination ofheuristic rules and statistical methods allows to relatively effectivelyclassify legal documents. More specifically, in a specific embodiment ofthe invention, associating the document 100 with a specific categoryincludes computing for each category from the set of predeterminedcategories a respective document categorization score indicative of alikelihood that the document is classifiable in each category. Thedocument categorization score is computed from the document.

The specific category to be associated with the document 100 is acategory from the set of predetermined categories for which a documentcategorization score associated therewith is maximal. In a specificembodiment of the invention, computing the document categorizationscores includes computing a categorization statistical score bycomputing a document statistic of the document 100 and comparing thedocument statistic with a set of predetermined statistics, eachpredetermined statistic being associated with a respective predeterminedcategory from the set of predetermined categories.

The predetermined statistics are representative of documentsclassifiable in the respective predetermined categories to which theyare associated. In other words, the predetermined statistics are used tocompare the statistics of the document 100 to predetermined statisticsthat are known to represent text classifiable in the predeterminedcategories. For example, the predetermined statistics have been obtainedby computing the statistic for documents that have been manuallyclassified by a human. Once these predetermined statistics have beencomputed for a sample, they are used without any change to classify newdocuments. In other embodiments of the invention, when an error isdetected in the classification made by the method 200, the predeterminedstatistics are updated according to a rightful classification of thedocument 100 determined by a human user. An example of a suitablestatistic usable with the method 200 is a document statistic obtainedusing a support vector machine method. This method is well known in theart and will therefore not be described in further details.

In addition to using statistical methods, the categorization performedat step 210 may also use a set of predetermined heuristic rules tocompute a document heuristic score. More specifically, the documentcategorization score may be computed by applying a set of predeterminedcategorization rules to the document 100. Each predeterminedcategorization rule, when applied to the document, results in thecomputation of a respective categorization rule score. Thecategorization rule scores are combined to each other to obtain adocument categorization score.

For example, judgments including the following expressions:“infringement”, “injunctions”, “licensee” and “assessment of costs” arelikely to be related to intellectual property. Therefore, the presenceof these expressions in a document 100 increases a documentcategorization score for classification in an intellectual propertycategory. Also, judgments including the following expressions:patent(s), NOC, Notice of Compliance, Notice of Application and Ministerof Health that are known to be related to intellectual property arelikely to be related to patents. Therefore, the presence of theseexpressions in a document 100 increases a document categorization scorefor classification in an intellectual property/patent category, which isa subcategory of an intellectual property category.

In a variant, a number, which may be positive or negative, is obtainedby applying each rule to the document 100. For example, the presence ofcertain words may raise the document categorization score associatedwith a certain category but lower the categorization score associatedwith another category. The document categorization scores are afterwardscombined, eventually with the document statistical score, to obtain adocument categorization score representing the likelihood that thedocument 100 belongs to each of the predetermined categories.Afterwards, selecting the highest categorization score allows todetermine which category the document should be classified into.

At step 215, the document 100 is divided into a plurality of textsegments. In some embodiment of the invention, the text segmentscorrespond to sections 105 a, 105 b and 105 c or to paragraphs 115 a and115 b. In yet other embodiments of the invention, the text segmentscorrespond to sentences 120 a to 120 d or to words 125 a to 125 e. Inyet other embodiments of the invention, the text segments correspond toany other suitable segments of the document 100. In a specificembodiment of the invention that has been found to be particularlysuitable for the summarization of judgments, the text segments includecontiguous paragraphs belonging to the same theme.

For example, in the context of court judgment categorization, thesethemes may includes the themes “decision data”, which includes thereference for the judgment and information related to the partiesinvolved, “introduction”, which states the persons involved in thejudgment and the subject matter to be resolved, “context”, which statesthe facts and events that led to a lawsuit to be filed, “submission”,which presents the arguments of each party relating to each issue,“issues”, which identifies the questions of law addressed by the court,“judicial analysis”, which state the reasoning and jurisprudence used bythe judge to arrive to his conclusion and “conclusion”, which expressesthe final decision of the court.

It should be noted that in this specific example, all segments are notnecessarily used during the summarization step of the method 200. Forexample, the “submission” theme is relatively unimportant in somecontext and may therefore be completely ignored at the summarizationstep. However, segmenting this theme separately from the other themesallows to relatively easily distinguish this text than is ignored at thesummarization step.

Also, in this example, another theme that is particularly useful is the“issues” theme. Indeed, once the issues have been identified, lookingfor the sections of text that address these issues at the summarizationstep is facilitated. For example, it is expected that all the issuesidentified should be addressed in the document 100, which helps inproducing an accurate document summary by implementing the summarizationstep such that as many issues are included in the summary as the numberof issues found in the “issues” theme.

In a variant, associating each text segment from the plurality of textsegments to one of the themes selected from the set of predeterminedthemes includes computing for each text segment from the plurality oftext segments a set of segment categorization scores. Each segmentcategorization score from the set of segment categorization scores isassociated with a respective theme from the set of predetermined themesand is indicative of the likelihood that the text segments isclassifiable in the theme. In these embodiments, each text segment isassociated with a theme from the set of predetermined themes for whichthe segment categorization score associated therewith is maximal.

In some embodiments of the invention, computing the segmentcategorization score includes computing a segment statistic of the textsegment and comparing the segment statistic with a set of predeterminedsegment statistics. The predetermined segment statistics are associatedeach with a respective predetermined theme from the set of predeterminedthemes and representative of segments that are classified in theirrespective predetermined themes for documents classified in the specificcategory into which the document 100 is classified. The predeterminedsegment statistics are obtained from documents that have been manuallysegmented by humans and for which the statistic has been computed. Thepredetermined segment statistics may be computed and fixed or otherwiseiteratively corrected when the method 200 is applied to many documents.

For example, the segment statistics depend on at least one factorselected from: a section in which the paragraph included in the textsegment is found, a position of the paragraph in the document, apresence of a predetermined group of words in the paragraph, andlinguistic information derived from words included in the paragraph.

Also, heuristic rules may be also involved to produce scores that may becombined to the computed statistics to segment the document, in a mannersimilar to the manner in which categorization scores are computed toclassify the document 100. For example, these heuristic rules mayinclude rules regarding the position of paragraphs in the document 100or theme, linguistic rules and rules based on specific knowledge of thefield to which the document 100 relates.

At step 220, the segmented document 100 is summarized. For example, thedocument summary may be produced by selecting sentences from thedocument 100 to be included in the document summary. To this effect, insome embodiments of the invention, a respective sentence scoreindicative of a likelihood that a sentence is important in summarizingthe document is computed for each sentence in the document, and thesentences having the highest sentence score are selected for inclusionin the summary.

For example, computing the sentence scores includes computing a sentencestatistic of each of the sentences of the document. For example, thesentence statistic depends on at least one factor selected from: theposition of the sentence in the document, a position of a paragraph inwhich a sentence is included in the section in which the paragraph isincluded, a frequency of words or textual units includes in the sentencecompared to a frequency with which the words or textual units areincludes in the document, an expected frequency with which the words ortextual units included in the sentence are expected to be included indocuments categorized in the specific category and in themes associatedwith the paragraph in which the sentence is included, among otherpossibilities.

Also, in some embodiments of the invention, computing the sentence scoreincludes computing a heuristic sentence score from the sentence byapplying the set of predetermined heuristic sentence rules to thesentence, each heuristic sentence rule being associated with thesentence rules score. Afterwards, the sentence rules scores are combinedto obtain a heuristic sentence score, for example by adding the sentencerule scores to each other.

A non-limiting example of a sentence rule is as follows. If the document100 is known to be in an Immigration/Refugee/Abandonment category, and a“context” theme is summarized, sentences including the following textualunits increase-the sentence score of sentences in which they are found:“Abandon . . . claim”, “Claim/application . . . abandoned”, “Abandonment. . . hearing”.

Finally, the heuristic sentence score and the sentence statistic arecombined to obtain a sentence score, which is used to select sentencesfor inclusion into the summary. In some embodiments of the invention,the document is summarized by including sentences having a score higherthan a threshold score. For example, the threshold score is apredetermined score. In alternative embodiments of the invention, thethreshold score is adjusted on a document-by-document basis so that thesummary document has a length that is smaller than a predetermined size,as measured using any suitable document length measurement.

For example, the predetermined size is a fixed percentage of the size ofthe document to be summarized. It has been found that a percentage offrom about 5 to15 percents, and in some embodiments about 10 percents,gives good results in summarizing legal documents, such as judgments. Inother embodiments of the invention, the document summary has apredetermined size, such as for example a size enabling to print thedocument summary in a predetermined font onto a single page.

In some embodiments of the invention, threshold scores are selectedindividually for each of the predetermined themes so that sentencesselected to be part of the document summary for each theme represent apredetermined fraction of the document summary. For example, it has beenfound that a specific repartition of the length of each theme within thesummary according to the following reparation provides advantageouslyconcise and accurate summaries: Introduction: 10% of summary; Context:25% of summary; Juridical Analysis: 60% of summary and Conclusion: 5% ofsummary.

In some embodiments of the invention, the step 220 of summarizing thedocument includes filtering the document 100 to remove words satisfyinga predetermined word rejection criterion prior to computing the sentencescores. For example, quotations of other judgments are typicallyrelatively unimportant in producing summaries as they merely repeatextracts from other judgments. Therefore, formatting and linguisticinformation may be used to form filtering rules that recognizeautomatically such quotations.

In some embodiments of the invention, the document summary is translatedinto a language different from the language in which it has beenproduced. For example, the translation may be performed usingtranslation rules that are dependent on the specific category into whichdocument 100 is classified. Also, the translation rules may depend onthe specific themes in which each sentence present in the summarydocument has been classified previously. Also, in some embodiments ofthe invention, the program element 300 is able to process documentswritten in more than one language, such that the summarization processoccurs in the language in which the document has been written.

In some embodiments of the invention, the document summary is generatedonly by summarizing segments classified as introductory segments. Forexample, the introduction segment is summarized by removing secondaryinformation from this introduction segment, such as for example andnon-limitingly, dates, names of parties, information between parenthesisor brackets, and subordinate clauses. In alternative embodiments of theinvention, the document summary is generated by researchingpredetermined expressions in the segmented document and extractingsentences including these expressions to form the document summary. Forexample, at least some of these expressions are associated with at leastone of the themes. It is also within the scope of the invention tocombine any number of the above-described summarization methods toproduce the document summary. In yet other embodiments of the invention,the specific category with which the document 100 is associated mayinfluence the segments used to produce the summary document. Forexample, in an immigration judgment, there is typically an error of lawthat the judgment addresses. This information is relatively importantand may therefore be searched for in the document 100 for inclusion inthe document summary.

FIG. 4 illustrates a program element 300 implementing the method 200.The program 300 includes an input module 310 for receiving the document100. In some embodiments of the invention, the input module 310 performsa language recognition to recognize the language in which the document100 is written. The input module 310 then transfers the document 100 toa categorization module 315 that broadly implements step 205 ofcategorizing the document 100. The categorized document is then sent toa segmenting module 320 that broadly segments the document as describedhereinabove with respect to step 215. Afterwards, the segmented documentis sent to a summarization module 325 that summarizes the document 100according to the method detailed hereinabove with respect to step 220.Finally, the program element 300 includes an output module 330 foroutputting the document summary.

In some embodiments of the invention, the document summary is added to asummary database 335 of document summaries. In some embodiments of theinvention, the output module also translates the document summary in oneor more languages different from the language in which the document 100is written. In these embodiments, the document summaries are stored inmultiple copies in the summary database, each copy corresponding to adifferent language. In these embodiments, each of the documentsummaries, for example document summaries 1 and 2 336A and 337A are eachassociated with a respective translated document summary 1 and 2 336Band 337B.

The summary database 335 is searchable using a search engine 340. Forexample, the search engine 340 is operative for searching the summarydatabase 335 in all the languages in which the output module 330 outputsdocument summaries. Therefore, documents that were originally in any ofthese languages may be searched using any specific one of the languages.This approach typically produces better search results than conventionalsearch engines that would translate a query into many languages prior todoing the search. Indeed, the output module 330 uses a priori knowledgeconcerning the document 100 to translate the summaries, such as forexample the category into which the document 100 is classified. Thisallows to typically produce more accurate translated document summariesthan would be possible without using this approach.

Examples of specific manners of implementing details of theabove-described method are found in the following documents, which arehereby incorporated by reference in their entirety:

-   -   Atefeh Farzindar, Frédérik Rozon and Guy Lapalme. CATS a        topic-oriented multi-document summarization system. DUC2005        Workshop, p. 8 Vancouver, October 2005 NIST.    -   Atefeh Farzindar. Automatic summarization of legal texts, Ph.D.        Thesis, University of Montreal and University of Paris        IV-Sorbonne, March 2005.    -   Atefeh FARZINDAR and Guy LAPALME, <<LetSUM, an automatic Legal        Text Summarizing System>>, In Thomas F. Gordon (editors), Legal        Knowledge and Information Systems, Jurix 2004: the Sevententh        Annual Conference, p. 11-18, IOS Press, Berlin, December 2004.    -   Atefeh FARZINDAR and Guy LAPALME, <<LetSUM, a Text Summarization        System in Law Field>>, THE FACE OF TEXT conference (Computer        Assisted Text Analysis in the Humanities), p. 27-36, McMaster        University, Hamilton, Ontario, Canada, November 2004.    -   Atefeh FARZINDAR and Guy LAPALME, <<The use of thematic        structure and concept identification for legal text        summarization>>, Computational Linguistics in the North-East        (CLiNE 2004), p. 67-71, Montréal, Québec, Canada, August 2004.    -   Atefeh FARZINDAR and Guy LAPALME, <<Legal texts summarization by        exploration of the thematic structures and argumentative        roles.>> ext Summarization Branches Out Conference held in        conjunction with ACL04 Text Summarization Branches Out,        Barcelona, Spain, July 2004.    -   Atefeh FARZINDAR and Guy LAPALME, <<Using Background Information        for Multi-document Summarization and Summaries in Response to a        Question>>, HLT-NAACL 2003 Workshop on Text Summarization,        Edmonton, Canada.

Although the present invention has been described hereinabove by way ofpreferred embodiments thereof, it can be modified, without departingfrom the spirit and nature of the subject invention as defined in theappended claims.

1. A method for producing a document summary from a document, saiddocument including a plurality of words and being segmentable into aplurality of text segments, each text segment including at least oneword, said document being classifiable as belonging to a categoryselected from a set of predetermined categories and each text segmentbeing classifiable as belonging to a theme selected from a set ofpredetermined themes, said method comprising: associating with saiddocument a specific category from said set of predetermined categories;performing a thematic segmentation of said document to produce asegmented document, said segmented document including said plurality oftext segments; associating with each text segment from said plurality oftext segments a theme selected from said set of predetermined themes;and summarizing said segmented document to produce said document summaryby processing each text segment from said plurality of text segments toeither select at least one summary textual unit from said text segment,said at least on summary textual unit including at least one of saidword, said at least one summary textual unit being a textual unitconsidered important in summarizing said document; or extract no textualunit from said text segment; said summary textual units being used toform said document summary; wherein said thematic segmentation isdependent on said category to which said document is associated and saidsummary textual units are selected for each text segment depending onsaid theme with which said text segment is associated.
 2. A method asdefined in claim 1, wherein associating said document with a specificcategory includes computing for each category from said set ofpredetermined categories a respective document categorization scoreindicative of a likelihood that said document is classifiable in saidcategory, said document categorization score being computed from saiddocument, said specific category being a category from said set ofpredetermined categories for which said document categorization scoreassociated therewith is maximal.
 3. A method as defined in claim 2,wherein computing said document categorization scores includes computinga document statistic of said document and comparing said documentstatistic with a set of predetermined statistics, each predeterminedstatistic being associated with a respective predetermined category fromsaid set of predetermined category; and representative of documents thatare classifiable in said respective predetermined category.
 4. A methodas defined in claim 3, wherein said document statistic is obtained usinga support vector machine method.
 5. A method as defined in claim 2,wherein computing said document categorization scores includes applyinga set of predetermined categorization rules to said document, theapplication of each predetermined categorization rule to said documentresulting in the computation of a respective categorization rule score;and combining said categorization rule scores to obtain said documentcategorization scores.
 6. A method as defined in claim 2, whereincomputing said document categorization scores includes combining astatistical score and a heuristic score, each of said statistical andheuristic scores being computed from said document.
 7. A method asdefined in claim 2, wherein said set of predetermined categories is ahierarchical set of categories.
 8. A method as defined in claim 1,further comprising dividing said document into said plurality of textsegments.
 9. A method as defined in claim 8, wherein associating witheach text segment from said plurality of text segments said themeselected from said set of predetermined themes includes computing foreach text segment from said plurality of text segments a set of segmentcategorization scores, each segment categorization score from said setof segment categorization scores being associated with a respectivetheme from said set of predetermined themes and being indicative of alikelihood that said text segment is classifiable in said theme withwhich said segment categorization score is associated, each of said textsegment being associated with a theme from said set of predeterminedthemes for which said segment categorization score associated therewithis maximal.
 10. A method as defined in claim 9, wherein computing saidsegment categorization scores includes computing a segment statistic ofsaid text segment and comparing said segment statistic with a set ofpredetermined segment statistics, each predetermined segment statisticbeing associated with a respective predetermined theme from said set ofpredetermined themes; and representative of segments that are classifiedin said respective predetermined theme for document classified in saidspecific category.
 11. A method as defined in claim 10, wherein saiddocument includes at least one section identified by a section headingpresent in said document, each of said sections including at least oneparagraph, each of said paragraphs including at least one sentence, eachof said sentences including at least one word; each of said text segmentincludes at least one paragraph; each of said segment statistic dependson a least one factor from the set consisting of: a section in whichsaid at least one paragraph is included, a position of said at least oneparagraph in said document, a presence of a predetermined group of wordsin said at least one paragraph and linguistic information derived fromwords included in said at least one paragraph included in said textsegment.
 12. A method as defined in claim 1, wherein said documentincludes at least one section identified by a section heading present insaid document, each of said sections including at least one paragraph,each of said paragraphs including at least one sentence, each of saidsentences including at least one word; summarizing said segmenteddocument to produce said document summary includes computing for eachsentence of said document a respective sentence score indicative of alikelihood that said sentence is important in summarizing said document.13. A method as defined in claim 12, wherein computing said sentencescores for each sentence includes computing a sentence statistic of saidsentence.
 14. A method as defined in claim 13, wherein said sentencestatistic depends on at least one factor selected from the setconsisting of: a position of said sentence in said document, a positionof a paragraph in which said sentence is included in said section inwhich said paragraph is included; a frequency of words included in saidsentence as compared with a frequency with which said words are includedin said document, an expected frequency with which said words includedin said sentence are expected to be included in documents categorized insaid specific category and in themes associated with said paragraph inwhich said sentence is included, a frequency of textual units includedin said sentence as compared with a frequency with which said textualunits are included in said document, and an expected frequency withwhich textual units included in said sentence are expected to beincluded in documents categorized in said specific category and inthemes associated with said paragraph in which said sentence isincluded.
 15. A method as defined in claim 14, wherein computing saidsentence score includes, for each sentence, computing a heuristicsentence score from said sentence by applying a set of predeterminedheuristic sentence rules to said sentence, each heuristic sentence rulebeing associated with a sentence rule score; combining said sentencerule scores to obtain said heuristic sentence score; and combining saidheuristic sentence score and said sentence statistic to obtain saidsentence score.
 16. A method as defined in claim 15, wherein saiddocument summary includes sentences from said document having a sentencescore higher than a threshold score, said threshold score being selectedso that said summary document is smaller than a predetermined size. 17.A method as defined in claim 16, wherein said threshold score isselected individually for each of said predetermined themes so that saidsentences selected to be part of said document summary for each of saidpredetermined themes represent a predetermined fraction of saiddocument.
 18. A method as defined in claim 1, further comprisingfiltering said document to remove words satisfying a predetermined wordrejection criterion.
 19. A method as defined in claim 1, whereinsummarizing said document includes replacing in said documentexpressions included in a list of predetermined expressions byrespective predetermined abbreviations.
 20. A method as defined in claim1, further comprising translating said document summary.
 21. A method asdefined in claim 20, wherein translating said document is performedusing translation rules which depend on said specific category.
 22. Amethod as defined in claim 1, wherein said document is a court judgment.23. A computer readable storage medium containing a program element forexecution by a computing device, said program element being able toproduce a document summary from a document, said document including aplurality of words and being segmentable into a plurality of textsegments, each text segment including at least one word, said documentbeing classifiable as belonging to a category selected from a set ofpredetermined categories and each text segment being classifiable asbelonging to a theme selected from a set of predetermined themes, saidprogram element comprising: an input module operative for receiving thedocument; a categorization module operative for associating with saiddocument a specific category from said set of predetermined categories;a segmentation module operative for performing a thematic segmentationof said document to produce a segmented document, said segmenteddocument including said plurality of text segments; and associating witheach text segment from said plurality of text segments a theme selectedfrom said set of predetermined themes; a summarization module operativefor summarizing said segmented document to produce said document summaryby processing each text segment from said plurality of text segments toeither select at least one summary textual unit from said text segment,said at least on summary textual unit including at least one of saidword, said at least one summary textual unit being a textual unitconsidered important in summarizing said document; or extract no textualunit from said text segment; said summary textual units being used toform said document summary; and an output module operative for releasingthe summarized document; wherein said thematic segmentation is dependenton said category to which said document is associated and said summarytextual units are selected for each text segment depending on said themewith which said text segment is associated.